Amiga Format CD 32

home *** CD-ROM | disk | FTP | other *** search

/ Amiga Format CD 32 / Amiga Format AFCD32 (Nov 1998, Issue 117).iso / -seriously_amiga- / programming / basic / blitzc2p / blitzc2p.readme next >

Wrap

Text File | 1998-08-10 | 22KB | 437 lines

This is a collection of fast chunky-to-planar routines implemented into blitz basic for use in any software including commercial and shareware. There are five standard c2p's which have two versions, one to do a normal c2p operation and the other to do c2p as well as a clearscreen at the same time (25-30% faster than seperate clearscreen). There is a sixth c2p that is of a different design and has special requirements. c2p030only : Only for use on 68030 cpu's. For 68030 users, this c2p will perform better than all the others. c2p030onlyCLS: As above, except that it also clears (to a given longword) the chunky buffer that it has just read data from. c2p040only : Only for use on 68040 cpu's, but performs very well on anything higher. For 68040 users, this c2p will perform better than all the others. c2p040onlyCLS: As above, except that it also clears (to a given longword) the chunky buffer that it has just read data from. c2p060only : Only for use on 68060 cpu's. It is not, however, the fastest and you will find that c2p040only and c2pCACHE are faster. Probably does not perform very well on anything lower than an 060. c2p060onlyCLS: As above, except that it also clears (to a given longword) the chunky buffer that it has just read data from. c2pGeneric : A generic c2p for use on all cpu's if it is not possible to isolate the cpu model or to use seperate c2p's for different cpu's. Performs well on 030 but is somewhat slower than dedicated routines on higher processors. Mainly to provide support for 030, as higher processors will be crippled. c2pGenericCLS: As above, except that it also clears (to a given longword) the chunky buffer that it has just read data from. c2p040plus : A kind of generic routine for 040 or higher. Performs generally quite well on all 040 and 060 cpu's but is not as fast as dedicated c2p's. Not suitable for 030's. c2p040plusCLS: As above, except that is also clears (to a given longword) the chunky buffer that it has just read data from. c2pCACHE : Designed for use on anything from a 68040 upwards. Second fastest on 040/25 but joint fastest on 060/50. Perhaps more geared towards 68060 than anything lower. This c2p is less flexible and requires special treatment. c2pCACHECLS : This does not exist as it is not possible to meddle with the way that the routine specially handles datacaches, which would result if there were any additional writing to memory such as when a clearscreen is performed. Some general performance times for the routines are as follows. These times are inclusive of having a screen open and being displayed, of the specified dimensions (* indicates best suitability for the given c2p): c2p030only : On 68030/50Mhz PAL 68030 only * 320x200 @40.4fps * 320x256 @30.7fps On 68040/25Mhz DoublePAL 320x200 @42fps 320x256 @31fps On 68040/25Mhz PAL 320x200 @44fps 320x256 @36.5fps c2p030onlyCLS: On 68040/25Mhz PAL 68030 only 320x200 @38.5fps 320x256 @29.7fps c2p040only : On 68030/50Mhz PAL 68040 320x200 @28.2fps to 320x256 @21.6fps 68060 On 68040/25Mhz DoublePAL * 320x200 @49.6fps * 320x256 @36.2fps On 68040/25Mhz PAL * 320x200 @55.3fps * 320x256 @42.5fps On 68060/50Mhz PAL * 320x200 @66.1fps * 320x256 @50fps c2p040onlyCLS: On 68040/25Mhz PAL 68040 * 320x200 @49fps (seperate clearscreen ran about 45-46fps) to * 320x256 @37.1fps 68060 c2p060only : On 68030/50Mhz PAL 68060 only 320x200 @27.9fps 320x256 @21.5fps On 68040/25Mhz DoublePAL 320x200 @46.0fps 320x256 @34.2fps On 68050/25Mhz PAL 320x200 @48.2fps 320x256 @37.4fps On 68060/50Mhz PAL * 320x200 @66fps * 320x256 @50fps c2p060onlyCLS: On 68040/25Mhz PAL 68060 only 320x200 44.5fps 320x256 33.8fps c2pGeneric : On 68030/50Mhz PAL all, but * 320x200 @40.1fps mainly 68030 * 320x256 @30.7fps On 68040/25Mhz DoublePAL 320x200 @42fps 320x256 @31fps On 68040/25Mhz PAL 320x200 @44fps 320x256 @34fps c2pGenericCLS: On 68040/25Mhz PAL all, but 320x200 @38.8fps mainly 68030 320x256 @29.7fps c2p040plus : On 68030/50Mhz PAL 68040 320x200 @24.3fps to 320x256 @18.5fps 68060 On 68040/25Mhz DoublePAL * 320x200 @46fps * 320x256 @34.2fps On 68040/25Mhz PAL * 320x200 @49.2fps * 320x256 @37.9fps On 68060/50Mhz PAL * 320x200 @66fps * 320x256 @50fps c2p040plusCLS: On 68040/25Mhz PAL 68040 * 320x200 @45.6fps to * 320x256 @35fps 68060 c2pCACHE : On 68030/50Mhz PAL 68040 320x200 @23.5fps to 320x256 @18.0fps 68060 On 68040/25Mhz DoublePAL * 320x200 @47.1fps * 320x256 @35.3fps On 68040/25Mhz PAL * 320x200 @50fps * 320x256 @38.3fps On 68060/50Mhz PAL * 320x200 @66.1fps * 320x256 @49.6fps For 68030 owners, do not use c2p040plus, c2p040only, c2p060only, or c2pCACHE. These will give very bad performance on that cpu. All of the routines except for c2pCACHE allow you to specify the size of the chunky-to-planar operation by way of a c2pRoutineInit{} statement, where `Routine' is the name of the routine (e.g. c2p040onlyInit{}). If you alter the size of the c2p operation you should generally also alter the size of your planar destination bitmap to be equal. It is, however, possible to have a taller planar bitmap than the height of the chunky-to-planar operation. #c2pBPLSIZE has to be altered to reflect this. The planar height must always be equal to or greater than the chunky height. Each c2p routine has two inputs. The first parameter is the address of the chunky buffer and the second parameter is the address of the planar buffer. Planar memory must be contiguous so I suggest initialising a bank or reserving some memory, and then using CludgeBitmap. The inputs to the init statements are the width and height of the chunky buffer, hense the size of the c2p operation. The init routine only needs to be called once in a program for any number of c2p calls. c2pCACHE is different in that you must specify operation size in constants which cannot easily be altered during the running of the program, so you are restricted to one size of operation per program run. All you have to do to setup a c2p operation is something along these lines (for example): InitBank 2,320*256,$10000 ; Fastram chunky buffer InitBank 0,320*256,$10002 ; Chipram planar buffer CludgeBitmap 0,320,256,8,Bank(0) c2pGenericInit{320,256} c2pGeneric{Bank(2),Bank(0)} Of course, replace the c2pGeneric statements with the ones for the relevent c2p that you are using. The only exception to this is c2pCACHE. This requires that you cludge bitmaps to 8 bytes past the start of the planar buffer, and that you tell the c2pCACHE routine to output to an address 4 bytes past the start of the planar buffer. So you have to allow for this by reserving a little extra memory. Like this: InitBank 2,320*256,$10000 ; Fastram chunky buffer InitBank 0,(320*256)+8,$10002 ; Chipram planar buffer CludgeBitmap 0,320,256,8,Bank(0)+8 c2pCACHE{Bank(2),Bank(0)+4} As well as c2pCACHE having to be set up with constants, there is also no clearscreen version because it is not possible to implement it due to the nature of the way the c2p works. Generally you should ensure that the base address of a planar bitmap's bitplane data is aligned to the nearest 64 pixels. Reserving some memory with AllocMem or InitBank usually seems to do this very reliably. c2pCACHE requires that you create bitmaps at 8 bytes past the start of the data, and that you begin the c2p operation at 4 bytes past the start. This is to ensure that the data being displayed is 64-bit aligned otherwise you would get a lower datafetch. In amigamode, if the first longword of data that is being displayed is from a 64-bit aligned address, the o/s will use 64-bit datafetch which means faster chunky-to-planar conversion. If you begin to scroll the display with hardware scrolling and you go beyond 32 pixels, the first longword being displayed will no longer be 64-bit aligned, and so the o/s will automatically switch to fetchmode 1 or 2 (32-bit datafetch), which will slow down the c2p. More horrifically, the o/s will not use normally use fetchmode 0 but YOU should make sure that if you set the datafetch you do NOT use datafetch 0 because that will at least double the time it takes to do the c2p operation, and that is bad news. To do scrolling with chunky screens it is not normally the best idea to use hardware scrolling. The c2p's do not have a line modulo so you would have to make your planar bitmap 64 pixels wider which means a further 64x200 or 64x256 area to be converted. This is also a waste because one longword in chunky is only 4 pixels, so a harware scroll of 0..3 is normally all that is requires. So the remaining 60 pixels are a total waste. As such, I recommend using software scrolling and generally speaking, if you have enough power to use chunky-to-planar well then you should also be thinking of refreshing the whole screen every frame rather than any of the traditional scroll methods. Taking a leap to using chunky is also to take a leap towards other factors which come as part of the package. Screens are normally fully refreshed each frame, scrolling is done in software, blits are done with the cpu and generally there is cpu horsepower to back this all up. 030/50's are generally going to be a little limited in what can be achieved with a decent screensize. I suggest 040/25 is the entry-level for chunky-to-planar equipped software, unless you have direct output to a graphics card which does not therefore require any data-conversion. For a purely generic setup, use the c2pGeneric routine. It will, however, be quite crippling to 040 or 060 processors but will better support the low end. To take things one stage further, use also c2p040plus which is a generic routine for 040-060 cpu's. To take it to the next level you should be looking to have a specific routine for each cpu. For 030's use c2p030only and for 040's use c2p040only. It seems that c2p040only is actually faster than c2p060only when running on a 68060/50, but there is hardly anything in it so take your pick. c2pCACHE is another replacement possibility for c2p040plus and is faster but less flexible. Certainly you don't need to support ALL of the c2p's in your software as there is quite a lot of overlap and it may come down to personal taste. Personally I would use c2p030only for anything below 040, and c2p040only for anything from 040upwards. If I had to choose ONE generic c2p I would go for c2p040plus as it performs slighter better than c2pCACHE when used on 030's, although either of them on 030 are pretty poor, so I would generally target the software at 040 upwards. Generally speaking, the clearscreen routines save you between 3 and 5 frames per second compared with having to do a seperate clearscreen routine. Time is mainly gained by minor pipelining and the fact that the c2p routine is already handling and setting up the loop. All that has been done to facilitate the clearscreen is that (in most cases) a7 is loaded with #clearscreento, which is a longword, and then move.l (a0)+,Dn is converted to move.l (a0),Dn : move.l a7,(a0)+ ; or if it is the 030only or generic routine, then it has been converted to move.l (a0),Dn : move.l #clearscreento,(a0)+, because those routines do not have a7 spare. It is possible to do a screencopy at the same time as the c2p, but this is not feasible on c2p030only or c2pGeneric as there needs to be a spare register. Therefore, a screencopy in place of the clearscreen (which will also do the same thing as a clearscreen, effectively) is only viable on 040 upwards, and judging by the time it takes it may be better suited to 060 only. It is therefore suggested that it might be faster to do a seperate screencopy which is perhaps hardcoded and may use move16, which may equal or surpass the time that might be saved by doing the screencopy at the same time as the c2p. I HAVE done a screencopy test using c2p040only, in which move.l (a0)+,Dn has been converted to move.l (a0),Dn : move.l (a7)+,(a0)+ ; and it seems to perform @44.3fps for 320x200, or 34.1fps for 320x256 (040/25 results). This is an extra 3 frames per second on top of the c2pCLS time, or about 9-10fps for the copy compared with a c2p that does not do anything additional (c2p040only). If you can do a screencopy seperately, perhaps using movem or move16, faster than this on 040/25, then I suggest you do that rather than modify the c2p. Judging by the time it takes and the number of chunky blits it gobbles up I would suggest that fullscreen copy is not very viable on anything lower than 040 and is probably questionable on 040/25. If you have a horizontal strip at the top or bottom or even middle of your display that does not need to be clearscreened and yet is updated a lot, use a clearscreening c2p for the main game area and a non-clearscreening one for the panel area. When it comes to chunky blitting you need to take into account the processor you are working on. If you have anything from 68040 upwards it is faster to have mask data (same size as the graphic) and to write longwords to non-aligned addresses, than to try generating the mask on-the-fly. The code: move.l (a2)+,d0 : move.l d0,(a1) : move.l (a0)+,d1 : move.l d1,(a1)+ ; will do one longword of masked blit to anywhere on the screen, about 2-3 times faster than if you try to generate the mask from the source data. Also, writing to byte addresses is probably not supported on 68000, I'm not sure about 68020. But if you do it with a copyback cache it is very quick, so that masked blits are only about 30% slower than unmasked ones. If you are not going to write to non-aligned addresses you have to do shifting or rotating in the cpu, which if using mask data means the mask as well. This takes further time. But these memory-intensive methods may not prove to be quite so efficient on 030's as they do not have a copyback cache. I did not write any of the c2p's myself, only the minor modifications and the example program. You are free to use them all in any of your productions, freeware, shareware, and even commercialware. I hope you are thankful to those talented few that have mastered their craft in making these c2p's and for releasing them for public use. Please find also enclosed in this archive a demonstration program. There is an 040 version for 040-to-060, and an 030 version. This program will use a clearscreening c2p and will bounce a number of chunky cpu-blitted objects around the screen. The blit routines do not do any clipping and the loop for movement and rendering of the objects is currently hardcoded into a single statement. This is quite a bit faster than calling a statement for every object. There are some constants which you can alter. The demonstration program has the facility to have a planar bitmap height larger than the chunky bitmap height. You must not allow it to be smaller, however! #planarheight should be >= #c2pBPLY. If you use: #c2pBPLY=200 : #planarheight=256 ; the routine will do a c2p operation on the first 200 lines and leave the bottom 56 lines as they are. This shows how you can use the verticle modulo should you need to. There are other constants to alter. #iterations is how many loops will be done before the program exits. #objcount is the number of cpu-blit objects that will be moved and drawn. Refer to the example 040/25 results for guidelines as to what to set this at. #objwidth and #objheight are the size of the objects. They must not be larger than the size of the chunky buffer and preferably should be at least about 16 pixels smaller in both dimensions for the movement routine to work properly. #objwidth must be a multiple of 4 and must not be smaller than 4. #objheight does not need to be a multiple of anything but must not be smaller than 1. The routine currently has constants which will render 85 32x32 256-colour masked objects with a screen size of 320x240. You should use PAL as preferable to DoublePAL if you want a higher framerate. 320x200 will yield even higher results. If you alter the chunky height don't forget about the planarheight. #objmasking should be set to either 0 or nonzero, which means you can use anything other than 0 (1, -1, 20, -50, etc). If zero, there will be no masking performed and you will attain higher output, but all objects will be solid. If objmasking is nonzero, there will be masking and any zero pixels will be transparent. The routine will default to using masking. The mask routine uses a prerendered mask image, similar to planar masks, except that it is a byte for every pixel. This is unavoidable if there is to be such speedy processing. It is perfectly feasible to generate the mask different to the graphic data so that any number of colours will be transparent. Don't forget that if you specify an area as solid when it is blank in the graphic, the blank pixel will be drawn. There is very little difference between the masked and unmasked routines. You will notice that the masked routine could be simplified as and.l (a2)+,(a1) : or.l (a0)+,(a1)+ ; but this is illegal in 68000 so I have had to expand it a little. Currently both routines will allow total flexibility in terms of width and height (width to nearest 4 pixels), and so use an x loop nested inside a y loop. If you expand the x loop for a hardcoded version you will get more output, and similarly with the y loop, although hardcoded large objects take up too much space to work in cache. Whereas it is possible to do 900 8x8 masked objects on 040/25, it is possible to do 1100 if the routine is hardcoded for 8x8 with both loops fully unfolded (ie no loops). The larger the objects you use the less intermediary time is used in setting them up. Lots of small objects take a lot of processing of the movement table. Typically, there is time to do about two and a half 320x200 screenfulls of blitting in the time left after the clearscreening c2p, on 040/25. The objects that the demoroutine uses are generated when you run the program so they are only basically random pixels and the palettes are fairly random too. I have also added a second demo program, which is the same as the first except that it is dynamic in the number of objects that it displays. You set it a target frames-per-second rate that you want to know results for. You tell it how many objects to start off with (must be greater than 0) and how many objects to add each time (greater than 0). Iterations in the second demo represent how many loops to do before adding more objects, and this should not be too low or the routine won't work out wether it's reached the target framerate properly or not. You have to set a maximum number of objects, because the table has to be initialised for the eventuality of that many objects becoming displayed. I set it to 3000 initially which is more than enough for most usual object sizes on all cpu's. There are versions of the demo for 030 and for 040(+) as before. You set the program running and it will progressively add objects and move them and will keep doing so until it reaches the target framerate. Then it will tell you what precicely the framerate was at the end and how many objects it managed to display at that rate. So instead of having to do tests on some constant number of objects, you can let the routine chug away adding more and more until you are maxxed out for the selected framerate. The demo will default to doing 16x16 objects, starting with 10, and adding 10 more ever 40 loops. You can set the starting number of objects (objcount) to a value much closer to what you expect will be the end result, in order to hasten the report. The starting number of objects should never excede the maximum number, however, or it will probably start drawing objects everywhere in memory (65536x65536!). Unless you are particularly fussy you do NOT need to have a planar doublebuffer when using the c2p's, so long as they are running quite fast, ie that the overall routine is not slowing below 25fps, or not much anyway. There will be a slight flicker on one or two lines of the display, perhaps, but it will not be the full-screen type of flickering that you can get on planar. Yes, the c2p's are outputting to planar but the way they do it seems to minimise flicker. I personally do not use a doublebuffer and I hardly notice that I haven't. When you're in the middle of all the action you won't notice either so long as things don't slow down too much. Even if the overall routine slows down the c2p will still take the same amount of time so it should be okay. So you can probably cut out the time it takes to do screen swapping or other doublebuffer methods. Of course, if you have a graphics card, it is fairly normal to refresh straight into the display and people have reported that there is little or no flicker whatsoever. I would be interested to know how any of these routines perform on your specification of Amiga, and particularly how well the clearscreening c2p's do on 68060's, ie, does it clearscreen `for free'. Problems or ideas, give me a yell. If you get any problems implementing or using or adapting or modifying the routines, email at paul@stationone.demon.co.uk Enjoy.